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Abstract This paper addresses the problem of Monte Carlo approximation of 
posterior probability distributions. In particular, we have considered a recently 
proposed technique known as population Monte Carlo (PMC), which is based 
on an iterative importance sampling approach. An important drawback of this 
methodology is the degeneracy of the importance weights when the dimension 
of either the observations or the variables of interest is high. To alleviate this 
difficulty, we propose a novel method that performs a nonlinear transforma- 
tion on the importance weights. This operation reduces the weight variation, 
hence it avoids their degeneracy and increases the efficiency of the importance 
sampling scheme, specially when drawing from a proposal functions which are 
poorly adapted to the true posterior. 

For the sake of illustration, we have applied the proposed algorithm to 
the estimation of the parameters of a Gaussian mixture model. This is a very 
simple problem that enables us to clearly show and discuss the main features 
of the proposed technique. As a practical application, we have also consid- 
ered the popular (and challenging) problem of estimating the rate parameters 
of stochastic kinetic models (SKM). SKMs are highly multivariate systems 
that model molecular interactions in biological and chemical problems. We 
introduce a particularization of the proposed algorithm to SKMs and present 
numerical results. 
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1 Introduction 

The problem of performing inference in high dimensional spaces appears in 
many practical applications. For example, it is of increasing interest in the 
biological sciences to develop new techniques that allow for the efficient esti- 
mation of the parameters governing the behavior of complex autoregulatory 
networks. The main difficulty often encountered when tackling this kind of 
problems is the design of numerical inference algorithms which are stable and 
have a guaranteed convergence also in high-dimensional spaces. 

A very common strategy, which has been successfully applied in a broad 
variety of complex problems, is the Monte Carlo methodology. In particu- 
lar, we have considered a recently proposed technique known as population 
Monte Carlo (PMC) [7J, which is based on an iterative importance sampling 
approach. The aim of this method is the approximation of static probability 
distributions by way of discrete random measures consisting of samples and 
associated weights. The target distribution is often the posterior distribution 
of a set variables of interest, given some observed data. 

The main advantages of the PMC scheme, compared to the widely estab- 
lished Markov chain Monte Carlo (MCMC) methodology, are the possibility 
of developing parallel implementations, the sample independence and the fact 
that an unbiased estimate is provided at each iteration, which avoids the need 
of a convergence period. 

On the contrary, an important drawback of the importance sampling ap- 
proach, and particulary of PMC, is that its performance heavily depends on 
the choice of the proposal distribution (or importance function) that is used 
to generate the samples and compute the weights. When the dimension of the 
variables of interest is large, or the target probability density function (pdf) 
is very sharp with respect to the proposal (this occurs when, e.g., the num- 
ber of observations is high or when the prior distribution is uninformative) , 
the importance weights degenerat^ leading to an extremely low number of 
representative samples. This problem is commonly known as weight degen- 
eracy and is closely related to the "curse of dimensionality" . To the best of 
our knowledge, the degeneracy problem has not been successfully addressed 
so far in the PMC framework. The issue was already mentioned in the original 
paper [7] , though, noting that the proposed scheme did not provide neither a 
stabilization of the so called effective sample size, nor of the variance of the 
importance weights. 

The effort in the field of PMC algorithms has been mainly directed to- 
ward the design of efficient proposal functions. A recently proposed scheme 
for the proposal update is the mixture PMC technique [5], that models the 
importance functions as mixtures of kernels. Both the weights and the inter- 
nal parameters of each mixture component are adapted along the iterations 
with the goal of minimizing the Kullback-Leiber divergence between the target 



1 In this context, degeneracy means that the vast majority of importance weights become 
negligible (practically zero) except for a very small number of them 1 1 II 1 5 1 ■ 
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density and the proposal. However, this scheme also suffers from degeneracy 
unless applied to very simple examples and the authors of [B] propose to ap- 
ply an additional Rao-Blackwellization scheme to mitigate the problems that 
appear in multidimensional problems. 

Another recently proposed PMC scheme [17] is based on the Gibbs sam- 
pling method and constructs the importance functions as alternating con- 
ditional densities. Thus, this method allows to sample efficiently from high- 
dimensional proposals. However, the importance weights still present severe de- 
generacy due to the extreme values of the likelihood function in high-dimensional 
spaces, unless the number of samples is extremely high. The technique is based 
on the multiple marginalized PMC, proposed in [4], [19] . 

In this paper we propose a novel PMC method, termed nonlinear PMC 
(NPMC). The emphasis is not placed on the proposal update scheme, which 
can be very simple (here we restrict ourselves to importance functions chosen 
as multivariate normal denisities whose parameters are adjusted along the it- 
erations to match the moments of the latest approximation of the posterior 
distribution) . The main feature of the technique is the application of a nonlin- 
ear transformation to the importance weights in order to reduce their varia- 
tions. In this way, the efficiency of the sampling scheme is improved (specially 
when drawing from "poor" proposals) and, most importantly, the degeneracy 
of the weights is drastically mitigated even when the number of generated sam- 
ples is relatively small. We show analytically that the approximate integrals 
computed using the transformed weights converge asymptotically to the true 
integrals. 

In this work we have successfully applied the proposed methodology to 
two different problems where severe weight degeneracy is observed. The first 
example is a Gaussian mixture model, already discussed in [7J. We have used 
this example to illustrate the degeneracy problem and for comparison of the 
original PMC scheme with the proposed methods. We provide numerical re- 
sults that suggest that degeneracy of the importance weights takes places even 
in simple and low-dimensional problems, when the proposal pdf is wide with 
respect to the target. The proposed method clearly outperforms the original 
PMC scheme and provides a high number of representative samples and very 
accurate estimates of the variables of interest. 

As a practical application, we have also applied the proposed algorithm to 
the popular (and challenging) problem of estimating the rate parameters in 
stochastic kinetic models [3]. Such models describe the time evolution of the 
population of a set of species or chemical reactions, which evolve according 
to a set of constant rate parameters, and present an autoregulatory behavior. 
This problem is currently of great interest in a broad diversity of biological 
and molecular problems, such as complex auto-regulatory gene networks. 

We introduce a particularization of the generic NPMC algorithm to SKMs, 
which tackles the evaluation of the likelihood function (which can not be com- 
puted exactly) using a particle filter. As a simple and intuitive, yet physi- 
cally meaningful example, we have obtained numerical results for the Lotka- 
Volterra model, also known as predator-prey model, consisting of two inter- 
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acting species related by three reactions with associated unknown rates. The 
proposed method provides a very good performance also in this scenario, where 
standard PMC algorithms can be easily shown to fail. 

The rest of the paper is organized as follows. In Section[2]we introduce some 
notation. In Section [31 we give a formal statement of the class of problems ad- 
dressed in this paper. In Section 21 the population Monte Carlo algorithm is 
presented and a discussion of the weight degeneracy problem is provided. In 
section [5] the proposed algorithm is described. In section [6] we provide a theo- 
retical convergence analysis of the new methodology. In section [7] we present 
numerical results on a Gaussian mixture model. We illustrate the effects of 
degeneracy and the performance of the proposed algorithms in this scenario. 
In Section [FJ we describe the practical application of the proposed algorithm 
to the problem of estimating the constant rates of a stochastic kinetic model, 
and present numerical results. Section [9] is devoted to the conclusions. 



2 Notation 

In this section we introduce some notations that are used through the paper. 

— All vectors are assumed to have column form. We denote them using bold- 
face lower-case letters, e.g., 6, y, and its A:-th scalar component is written as 
the corresponding normal-face letter with a subscript, e.g., 0k, yu- Matrices 
are denoted by boldface upper-case letters, e.g., S. N. 

— We use M. K , with integer K > 1, to denote the set of if -dimensional vectors 
with real entries. 

— B(R K ) is the set of bounded real functions over M. K . In particular, the 
supremum norm of a real function / : M. K — > M. is denoted as ||/||oo = 

|/(z)| and / e B(R K ) when H/IU < oo. 

— The target pdf of a PMC algorithm is denoted as ir, the proposal density 
as q and the rest of pdf 's as p. 

— We write conditional pdf's asp(y|0), and joint densities asp(8) = p{0\, . . . , Ok)- 
This is an argument-wise notation, hence p(0\) denotes the distribution of 

9i, possibly different from p{02), which represents the distribution of 02- 

— The integral of a function / with respect to a measure fi is denoted by 
the shorthand (/, /i) = / f(0)fi(d9). If the measure has a density tt, i.e., 
[i(d6) = ir(6)d6, then we also write (/,tt) = J f(6)-K(0)d9. 

— £j>(6>) [/(#)] denotes expectation of the function / with respect to the prob- 
ability distribution with density p{6). 

— A sample from the distribution of the random vector 6 is denoted by 
Sets of M samples {0 (1) , . . . , (M) } are denoted as {6^}^. 

— 6g(i)(d8) is the unit delta measure located at 9 = 9^ l K 

— Unweighted samples are denoted with a tilde as j, opposite to the 

corresponding weighted samples 

— For a random value X, F{X < a} denotes the probability of the event 
X < a. 
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— In the chemical representation the notation ax must not be interpreted as 
a product, but rather is read as "a molecules of species x" . 



3 Problem Statement 

Let 9 — [0i, ... , 8k] T be a vector of K unobserved real random variables with 
prior density p(9) and let y = [yi, . . . , 2/jv] T be a vector of TV real random 
observations related to 9 by way of a conditional pdf p(y\0). 

In this paper we address the problem of approximating the posterior prob- 
ability distribution of 9, i.e., the distribution with density p(9\y), using a ran- 
dom grid of M points, {9^ } ££ l5 in the space of the random vector 9. Once the 
grid is generated, it is simple to approximate any moments of p(9\y). For ex- 
ample, the posterior mean of 9 given the observations y may be approximated 
as 

M 

E pW y)[0]= / 9p(9\y)d9K-Y,0 {t) - 

Unfortunately, the generation of useful samples that represent the proba- 
bility measure p(9\y)d9 adequately when K (or N) is large is normally a very 
difficult task. The main goal of this work is to devise and assess an efficient 
computational inference (Monte Carlo) methodology for the approximation of 
p(9\y)d9 and its moments, i.e., expectations of the form -E p (t>| y ) [/(0|y)], where 
/ : R K — > R is some integrable function of 9. 



4 Population Monte Carlo 

4.1 Importance Sampling 

One of the main applications of statistical Monte Carlo methods is the ap- 
proximation of integrals of the form 



(/,tt)= J f(9)n{9)d9, 



where / is a real, integrable function of 9 and it (9) is some pdf of interest 
(often termed the target density). Given the random sample we 
build a random discrete measure 

M 



M 

i=l 



that enables a straightforward approximation (/, tt ) of {f,ir), namely 

r i M 

(f,n) « (/,tt m ) = j f(9)n M (d9) = ^E/( 0W ) • 
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Under mild assumptions, it can be shown that [18j 

lim (f,n M ) = (f,n) a.s. 

M— >oo 

However, in many practical cases it is not possible to sample from tt(9) 
directly. A common approach to overcome this difficulty is to apply an im- 
portance sampling (IS) procedure [TS]. The key idea is to draw the samples 
{9^}f£ 1 from a (simpler) proposal pdf, or importance function, q(6), and 
then compute associated importance weights of the form 



7r(0W) 



which we subsequently normalize to yield 



w^ = — !S , i = l,...,M. 

The integral (/, n) is then approximated by the weighted sum 

M 

c/,^)=x;« w /(tf w ). 

i=l 

The efficiency of an IS algorithm depends heavily on the choice of the 
proposal, q{9). However, in order to ensure the asymptotic convergence of the 
approximation (/, 7r M ), when M is large enough, it is sufficient to select q(0) 
such that q{0) > whenever tt(9) > [T5]. 

Finally, note that the computation of the normalized weights requires that 
both ir(8) and q(0) can be evaluated up to a proportionality constant inde- 
pendent of 6. This requirement is a mild one, since in most problems it is 
possible to evaluate some function h(9) cx tt(0). 



4.2 Population Monte Carlo Algorithm 

The population Monte Carlo (PMC) method [7] is an iterative IS scheme 
that seeks to generate a sequence of proposal pdf's qe(9), £ = 1, . . . ,L, such 
that every new proposal is closer (in some adequate sense to be defined) to 
the target density tt(9) than the previous importance function. Such scheme 
demands, therefore, the ability to learn about the target tv(8), given the set 
of samples and weights at the (£ — l)-th iteration, in order to produce the 
new proposal qi(9) for the £-th iteration. Taking this ability for granted, the 
algorithm is simple and can be outlined as shown in Table [T] 
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Table 1 Generic PMC algorithm 

Iteration (I = Q, .. .,L): 

1. Select a proposal pdf qi(0). 

— If n(0) oc p(6\y)p(0) , at iteration I = the proposal may be selected as the prior 
q o (e)=p(0). 

— At iterations £ = I, . . . , L, the proposal pdf qt(0) must be adapted according to the 

f (i) d) I M 
weighted sample y0\_ 1 ,w i _ 1 j at the previous iteration. 

2. Draw a collection of M i.i.d. (independent and identically distributed) samples <9* f = 
\ef) M from<fc(0). 

^ J i=l 

3. Compute the normalized weights 



,M. 



4. Perform a resampling step according to the weights u>^ to create an unweighted sample 
setef = {^}^. 



At every iteration of the algorithm it is possible to compute, if needed, an 

M 
e 



estimate (f^Trf 1 ) of (/, it) as 



(/, 7 rf)=E^ ) /(^ ) ) 



i=l 



and, if the proposals qi(9) are actually improved across iterations, then it can 
be expected that the error |(/, n) — (/, 7r^ J )| also decreases with I. 

In problems of the type described in Section [31 the target density is the 
posterior pdf of 6, i.e., it(0) = p(0\y) oc p(y\Q)p(0), where p(y\0) is the like- 
lihood function of the variable 9 given the observations y. A straightforward 
way of initializing the algorithm is to use the prior as the starting proposal, 
q o (0)=p(9). 

A frequently used index for the performance of Monte Carlo sampling 
methods is the effective sample size (ESS) M e// and its normalized version 
(NESS) M neff , respectively defined as [18] 

1 A4 e f f 

M e// = — ^, and M neff = 



Eti W 



M 



These indices may be used to quantitatively monitor the convergence of the 
PMC algorithm and to stop the adaptation when the proposal has converged 
to the target. Thus, ideally we expect to observe an increase in both measures 
along the iterations, with M ne ^ approaching unity as the algorithm converges. 

However, unless the proposal pdf is well tailored to the target density, the 
resulting importance weights will often present very large variations, leading 
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to a low number of effective samples. This problem is well known to affect IS 
schemes and is usually termed as the degeneracy of the weights [lOlfTS] . 

4.3 Degeneracy of the importance weights 

The degeneracy of the importance weights is a problem that arises when the 
normalized importance weights iuM, i = 1, M, of a set of samples {6^}fL x 
present large fluctuations and the maximum of the weights, max,; w^' , is close 
to one, leading to an extremely low number of representative samples (i.e., 
samples with non negligible weights). This situation occurs when the target 
and the proposal densities are approximately mutually singular, i.e., they have 
(essentially) have disjoint support. This problem has been addressed in the sce- 
nario of very large scale systems [2j and it has been proved that the maximum 
of the sample weights converges to one if the sample size M is sub-exponential 
in the system dimension K. 

The degeneracy of the weights as the dimension of the system increases is 
an intuitive fact and has been widely accepted as one of the main drawbacks of 
IS schemes. Indeed, The degeneracy of the weights makes the application of IS 
to high dimensional systems (those with a large value of K) computationally 
intractable. However, not only a large value of K leads to degeneracy. Indeed, 
it can be easily verified (numerically) that existing PMC methods can suffer 
from degeneracy even when applied to very simple (low dimensional) systems. 
Assume that the target pdf is the posterior, ir(9) oc p(6\y)p(9), and consider a 
set of M samples {0^}ff. 1 drawn from the prior pdf p{6), which is the case at 
the first iteration of the PMC algorithm. Assuming conditionally independent 
observations, the importance weight associated to the i-th sample is given by 

N 

ocp(y|0«) = Y[p(y n \0 {l) ) , * = 1,...,M. (1) 

71 = 1 

Thus, the importance weights are obtained from a likelihood consisting of the 
product of a potentially large number of factors. This structure can lead to 
large fluctuations in the normalized importance weights and a very low number 
of effective samples. In fact, the degeneracy problem was already identified in 
the original work introducing the PMC methodology. The specific algorithm 
for which simulations are displayed in [7] can be shown to present sharp vari- 
ations in the effective sample size and a large variance of the estimators as 
well. However, to the best of our knowledge, no systematic solution has been 
provided so far for this problem. 

A parallelism exists between high dimension and high number of observa- 
tions, both leading to computational problems. On one hand, as the dimension 
K increases, clearly the chance to obtain a representative sample 9^ % ' decreases, 
since the state space is larger. On the other hand, as the number of observa- 
tions increases, the probability concentrates in a smaller region (the posterior 
pdf is sharper as a consequence, e.g., of a structure such as in Eq. [T]), which 
again leads to a low probability of obtaining representative samples. 
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Therefore, while the degeneracy of the weights increases critically with 
K [2], in low dimensinal systems it is mainly motivated by a high number 
of observations N , unless the computational inference method is explicitly 
designed to account for this difficulty. In section [7] we present numerical re- 
sults to support this claim, which provides a rationale to understand the poor 
performance of existing PMC algorithms in certain low dimensional models. 

In this paper we introduce a new methodology to tackle the weight de- 
generacy problem, either due to large K or to large AT. The key feature of 
the method is the application of a nonlinear transformation to the importance 
weights, in order to reduce their variations and avoid degeneracy. As a result, 
we obtain a number of effective samples that is large enough to adequately 
perform the proposal update and provide consistent estimates of the variables 
of interest. The new technique is introduced in Section [5] below. 

5 Algorithms 

In this section we describe the proposed algorithm, which is termed nonlinear 
PMC (NPMC). We adopt a very simple proposal update scheme, where the 
importance functions are multivariate Gaussian pdf 's with moments matched 
to our latest approximation of the posterior distribution. The key feature is the 
application of a nonlinear transformation of the importance weights. Besides 
the basic version of the algorithm, we propose an adaptive version where this 
transformation is only applied when the value of the effective sample size is 
below a certain threshold. Finally, we explore different forms of the weight 
transformation. 

5.1 Nonlinear PMC 

Assume, in the sequel, that the target pdf is the posterior density, i.e., it(0) oc 
p(6\y)p(8). For simplicity we select the importance functions in the PMC 
scheme as multivariate normal (MVN) densities. In the £-th iteration 



where hi is the mean vector and Si is a positive definite covariance marix. 
The parameters of the Gaussian proposal are chosen to match the moments of 
the distribution described by the discrete measure ontained after the (£ — 1)- 
th iteration of the PMC algorithm. In particular, we compute the mean and 
covariance as 



qi (0)=N(0;n t ,E t ), £ = 1,...,L, 




M 



and 



(2) 




(3) 



i=l 
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Table 2 Nonlinear PMC algorithm 
Iteration (1 = 0,.. .,L): 

1. Draw a collection of M samples Of 1 = | from the proposal density qe(0). 

— At iteration £ = 0, the proposal density is the prior pdf qo{6) = p(0). 

— At iterations I = 1, . . . , L the proposal is a MVN pdf qi(0) = Af(0; fJ.£, Si), where 
the mean and covariance are computed according to Eqs. J2} and l|3|l. 

2. Compute the unnormalized PMC weights 

(t) . p^Iy) p(y\ef) P (ef) 

= V / ex _V 7 4t^ i = l.--..Af. 



3. Perform non-linear transformations ipf 1 on the weights in order to smooth their varia- 
tions, that is, 

= V f («;«*) , i = l,..,M. 

4. Normalize the modified importance weights, 

ujW _ _J — i = l,...,A/. 

2^=1 «^ 

5. Resampling: for i = 1, . . . , M, let 0^' = 9^ with probability tu^', j = 1, . . . , M. We 
obtain (9? f . 



Let us remark that this particular proposal update scheme is not a constraint of 
the algorithm. It is actually independent of the weight update and resampling 
steps and can be designed as freely as in the standard PMC methodology. 

The key modification of the algorithm is the computation of the impor- 

(i) -U) 
tance weights. Given a variate Q\ , its asociated weight is computed as w\ oc 

ipf 1 (w^* , where wf 1 * = p{d\y)p{ff) / q{.{9^ is the standard unnormalized im- 
portance weight and ipf 1 : (0, oc) — > (0, oo) is a nonlinear positive function 
that may depend both on the iteration index £ and the number of sam- 
ples M. This nonlinearity should be chosen so as to reduce the variation 
of the resulting normalized weights, wf 1 — w^* / J^jLi ■ Intuitively, it 
should preserve the ordering of the samples (those with larger standard weights 
should also have the largest transformed weights) while reducing the difference 
maxi wjp — mini wj 1 ' or some other measure of weight variation. The proposed 
generic algorithm is outlined in Table [3] 

Step 5 of the NPMC method involves multinomial resampling, which con- 
sists in sampling with replacement from the set {0g with probabilities 

equal to the associated weights wf\ to obtain an unweighted set {of 
This is obviously not the only choice of resampling algorithm and we use it 
only for the sake of simplicity. See, e.g., [5] for an overview of resampling 
techniques. 



Population Monte Carlo with transformed weights 



11 



At each iteration £ = 0, . . . , L, we obtain random discrete approximations 
of the posterior distribution with density it{9). Indeed, the discrete measures 

M 

nf I (d9) = i M )> and 

M 

^{dd) = -Y / ^( de ) ( 4 ) 

1=1 

are, both of them, approximations of ir(9)d0. In particular, if / : M. K — > M. is 
an arbitrary integrable function of 9, then the integral (/, 7r) = J f (9)ir{9)d9 
can be approximated as either 

M 

8=1 

1 M 

i=i 

Note that the estimator (/, n¥) involves one extra Monte Carlo step (be- 
cause of resampling) and, hence, it can be shown to have more variance than 
(/, irf' 1 ) [9]. Therefore, we assume in the sequel that estimates are computed 
by way of the measure ivf 1 unless explicitly stated otherwise. 

Note as well that, since tt(9) oc p(9\y)p(9) , any expectation with respect to 
the posterior distribution is actually an integral with respect to the measure 
ir(9)d9, i.e., £p(6>| y [/(#)] = (/: tO, and, therefore, it can be approximated using 
Tif , namely, E p{ely [f(9)] = (/, Tff ). 

5.2 Modified NPMC 

The nonlinear transformation ipf 1 is mainly useful at the first iterations of 
the PMC scheme, when the proposal density does not fit the target density 
and the standard importance weights may display high variability. However, in 
some applications it may be possible to remove the nonlinear transformation 
after a few iterations, when the proposal is closer to the target. 

Table [3] displays a modification of the NPMC algorithm which consists 
in applying the nonlinear transformation only if the ESS is below a specified 
threshold. The modified algorithm only differs in steps 3 and 4 from the generic 
NPMC procedure, and they are outlined in the table. 

5.3 Selecting the transformation of the importance weights 

Several possibilities exist for the choice of the nonlinearity (pf 1 . In this section 
we describe and intuitively justify two specific functions based on the "tem- 
pering" and the "clipping" , respectively, of the standard importance weights. 
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Table 3 Modified NPMC algorithm 

Steps 3 and 4 of the NPMC algorithm arc replaced by the following computations: 
3. Compute the normalized importance weights ui. 1 ' and the corresponding ESS M R J ^ 



M 



off 



4. If M* < M^^ n , smooth and normalize the weights 

4> = ( M W>) , «(«) = , i = 1, . . ■ , Af. 

Otherwise, set ftj?' = tt)»^. 



5.5.i Tempering 

In this case, the nonlinear transformation consists in an exponentiation of 
the standard weights. In particular, the expression of the transformed weights 

W}' = Ift \w\ I = \w £ I , 1 = 1,..., M, 

where < 7£ < 1. The sequence j£, I = 1, . . . , L, has to be adapted along 
the iterations, taking low values at the first steps and getting closer to 1 (an, 
eventually, ji = 1) as the algorithm converges. 

A straightforward a priori choice for the 7^ sequence is a polinomial func- 
tion of £, e.g., 7f cx £ m , or a sigmoid function = 1 J e - t ■ However, this 
choices do not guarantee that the resulting ESS be large enough. While in 
simple examples this procedure provides a remarkable reduction of the weight 
variations and an increase of the ESS, in more complex problems it is not 
enough to guarantee a stable and consistent convergence, as will be shown nu- 
merically in Sections [7] and [5J In such cases, the sequence jg may be computed 
adaptively, according to the values taken by the standard importance weights 
wf\t = l,...,M. 

This methodology can be interpreted as a generalization of the simulated 
tempering process applied to the target density, which has been widely studied 
in the MCMC literature, [TB]. However, to the best of our knowledge, it has 
not been applied in iterative IS approaches. The technique can also related to 
the one proposed in |14) . which introduces a sequence of models 

tt £ (0) = Pe (9\y) a Pe (y\e)p(9), 1=1,..., L 

that converges to the true posterior pdf p(6\y) along the iterations. These 
models are constructed in such a way that they are simpler for Monte Carlo 
approximation, since they have a broader likelihood and partially avoid de- 
generacy. 
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5.3.2 Clipping 

In problems where the degeneracy of the weights is severe it is often difficult to 
apply a tempering procedure to soften the weights and to obtain a reasonable 
ESS. The softening factor must take extremely low values at the first iterations 
and a prior selection of the sequence is not straightforward. 

In this work we propose a simple technique that consists in clipping the 
standard weights that are above a certain threshold. As a consequence, a 
sufficient number of "flat" (modified) weights is guaranteed in the regions of 
the space of theta where the standard weights were larger. The ESS becomes 
correspondingly larger as well. 

Specifically, the modified weights at iteration I are computed from the 
original importance weights as 



_(i)* 

w = mm 



( w f*,r e M ), i = i,...,M, 



where the threshold T e is selected to guarantee that the number of samples 
0^ that satisfy wf 1 * > T e M is equal to Mt < M. The parameter Mt must 
be selected to represent adequately the multidimensional target posterior pdf 
7r(0)=p(e|y)._ 

An alternative approach to the hard clipping technique described above, is 
to apply a sigmoid transformation to the standard weights, resulting in a soft 
clipping, namely 

w)' = t 7-rr - Pe, i = l,...,M, 



I + exp 



2u£ J 

0e 



where ft > should increase along the iterations in order to progressively 
reduce the nonlinear distortion of the standard weights. 



6 Convergence of nonlinear importance sampling 

The convergence of the original PMC scheme is easily justified by the conver- 
gence of the standard IS method. Indeed, it can be proved that the discrete 
measure ^{dO) = Y^iLi w t> v„ii){dO) converges to ir(6)d9 under mild as- 
sumptions, meaning that 

lim |(/,7rf)-(/,7r)| = (5) 

M— >oc 

almost surely (a.s.) for every i £ {I, L} and any / E B(R K ). 

In this section we provide a result similar to Eq. ([5]) for the discrete measure 
nf 1 generated by the NPMC algorithm using a class of clipping transforma- 
tions. The analysis, therefore, is concerned with the asymptotic performance 
of the approximation as the number of amples M grows, and not with the 
iteration of the algorithm, i.e., not with the convergence as I increases. Hence, 
we shall drop the latter subscript for convenience in the sequel. 
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6.1 Notation and basic assumptions 

Let 7r be the pdf associated to the target probability distribution to be ap- 
proximated and let q be the importance function used to propose samples in 
an IS scheme (not necessarily normalized) and let h(9) = an(9) be a func- 
tion proportional to n, with the proportionality constant a independent of 9. 
The samples drawn from the distribution associated to q are denoted 8^', 
i = 1, M, while their associated unnormalized importance weights are 

i = L...,M. 

,(«<•>)' 

Let us define the weight function g = h/q, i.e., g{9) = h{0)/q(6) and, in 
particular, g(9^) = w 1 -^* . The support of g is the same as the support of 
q, denoted § C M. K . If we assume that both q(6) > and ir(6) < for any 
9 e S, then g{9) > for every 9 £ § as well. Also, trivially, tt oc gq, with the 
proportionality constant independent of 9. These assumptions are standard 
for classical IS. 

The approximation of the target probability measure generated by the 
standard IS method is 

M 

7T M (d0) = X> ( %t*>W> 
i=l 

where 



w v 



is the i-th normalized standard weight. 

The nonlinear transformation ip M of the weights is assumed to be of a 
clipping class. In particular, let %m be a permutation of the indices in 

{1, ...,M} such that < w^* < ... < w^ 1 ^* and choose M T < M and 

some constant C < oo. Then, the transformed weights satisfy 



w^* = ip M (w^*) < C, for k = 1,...,M T , and 

=ip M( w (i k )*^ = f or k = M T + 1, . . . , M 



(G) 



The condition above is can be satisfied for various forms of hard and soft 
clipping when the weight function g is upper bounded. In particular, if g 6 
B(K K ) then, trivially, w^* < \\g\\oo f° r either the hard or the soft clipping 
procedures described in Section 15. 3.21 fhence. C < ||g||oo)- 

The approximation of the target probability measure generated by the 
nonlinear IS method is 

M 
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where 

= y"(g(fl (<) )) 



is the i-th normalized transformed weight. Additionally, we introduce a non 
normalized approximation of n(6)d0 of the form 



M 

TT M (d8) = Y / ™ {t) ^(dO), (7) 

i=l 

where 

are non normalized transformed weights that will be referred to as "bridge 
weights" in the sequel. 



6.2 Asymptotic convergence 

We aim at proving that limM->oo \(f, k m ) — (/, 7r)| =0 a.s. for any / £ B(R K ). 
To obtain such a result, we split the problem into simpler questions by applying 
the triangle inequality 

|(/,7f M ) - (/,tt)| < \(f,n M ) - (f,n M )\ + |(/,7r M ) - (/,tt)|. (8) 

The second term on the right hand side of © is handled easily using standard 
IS theory. For the first term, we have to prove that the discrete measure 
generated by the nonlinear IS method (tt m ) converges to the discrete measure 
generated by the standard IS method (tt m ). This can be done by resorting to 
another triangle inequality, 

\(f,n M ) - (f,n M )\ < |(/,7f M ) - (f,7T M )\ + |(/,7T M ) - (/,tt m )|, (9) 

that reveals the role of the bridge measure defined in Q. 

The following lemma establishes the asymptotic convergence of the term 
(/,vf M ) - (/,tt m )| in inequality ©. 

Lemma 1 Assume that limM-^oo ^Tp = 0, g £ B(R K ), g > and the trans- 
formation function (p M satisfies Then, for every f £ B(M. K ) and suffi- 
ciently large M , there exist positive constants ci, c[ independent of M and Mt 
such that 

p||(/,7f M )-(/,# M )| <ci^| >1- cxp (-c' x M). 

Proof: See Appendix [A"l 

Next, we establish the convergence of the bridge measure tt m toward ir M . 
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Lemma 2 Assume that g € B(M. K ), g > and the transformation function 
tp M satisfies ftjjty. Then, for every f £ B(M. K ) there exist positive constants 
C2 , c' 2 independent of M and Mt such that 

P { |(/, * M ) (/, O I < ] > 1 - exp (-4M). 



M J 

Proof: See Appendix [Bl 

The combination of Lemmas Q] and [21 together with the triangle inequality 
©, yields the convergence of the error |(/, tt m ) — (/, ir M )\- 

Lemma 3 Assume that limM-*.oo -jrf- = 0, g G -B(R K ), g > and i/ie trans- 
formation function ip M satisfies Q). Then, for every f G -B(M^) 7 and siijffi- 
ciently large M , there exist positive constants c, c' independent of M and Mt 
such that 

P { | (/, k m ) - (f, tt m ) | < c^- 1 > 1 - 2 exp (-c'M). 



In particular. 



lim |(/,7f M )-(/,7r M )| = a. S . 

M— voo 



Proof: See Appendix [C] 

Finally, Lemma [3] can be combined with inequality ([5]) to yield the desired 
result, stated below. 



Theorem 1 Assume that g 6 B(M. K ) and the transformation function ip M 
satisfies ^ with 
every f E B{R K ) 



satisfies with C < \\g\\oo independent of M . 7/limM->oo = then, for 



lim \(f,7T M )~(f,n)\ =0 a.s. 

Proof: It is classical result that 

lim \(f,ir M )-(f,ir)\ = a.s. (10) 

Af-i-oo 

Combining (I10[) with the second part of Lemma |3] and the triangle inequality 
in (JSJ) yields the desired result. □ 



7 Example 1: a Gaussian mixture model 

In this section we provide numerical results that illustrate the degeneracy 
problem and the performance of the proposed NPMC scheme applied to the 
Gaussian mixture model example of [7]. 
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7.1 Problem setup 

We consider the Gaussian mixture model given by 

p(y\0) = pAf (y; 6^ a 2 ) + (1 - p)M (y; 9 2 , a 2 ) (11) 

where the variable of interest = [9i, 6y T has dimension K = 2, and contains 
the means of the mixture components. The true values of the unknowns are 
set to 6 = [0, 2] T . The mixture coefficient and the variance of the components 
are assumed to be known and set to p = 0.2 and er — 1. 

We assume a prior pdf p(6) = p{0\)p{92) composed of equal independent 
components for each unknown, given by p{0k) = A/" (0k;v, c 2 /A), for fc = 1,2. 
The hyperparameters are also assumed to be known and set to v = 1 and 
A = 0.1. 

We consider that a set y of N iid scalar observations drawn from the 
mixture model in equation (|11[) is available. We thus aim at approximating 
the target density tt(9) — p(6\y) cx p(y\0)p(0) , i.e., the posterior of 6 given 
the set of observations y. 

7.2 Degeneracy of the Importance weights 

In this section we analyze the degeneracy problem, addressed in Section 14.31 
in a simple and low dimensional IS example. 

In order to illustrate the effects of degeneracy, in this section we focus on 
the standard importance sampling procedure. We consider a set of M sam- 
ples M = {0 (i5 }^! drawn form the prior pdf p(0). Thus, the normalized 
importance weights are computed from the likelihood function as 

JV 

w& ocp(y|0«) = \{p{yn\e^) = 

71=1 

JV 

= [] PAA (y n - ef, a 2 ) + (1 - p)N (y n ; 9%\a 2 ) 
n=l 

For this model, we have analyzed the behavior of the maximum normalized 
importance weight maxw' 1 ' and the effective sample size M e ^ , when the 
number of observations N increases. We consider a number of observations N 
varying from 1 to 10 3 , and a number of samples M also varying from 1 to 
10 3 . For each pair of N and M we have performed P = 10 3 simulation runs, 
generating an independent vector of observations y and an independent set of 
samples 6> M in each run. 

In Figure [1] (left) the maximum importance weight averaged over P sim- 
ulation runs is represented as a function of the number of samples M and 
the number of observations N. The curves representing the extreme cases 
maxi w' ! ' = l/M (uniform weights) and maXjiuW = i (degeneracy) are also 
plotted on the graph. It can be observed that, as the number of observations, 
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Fig. 1 Evolution of the maximum importance weight (left) and the ESS (right) vs. the 
number of observations, N, and the number of samples, M. The curves corresponding to 
the best possible case(uniform weights and M effective samples) are depicted with squares. 
The curves representing extreme degeneracy (max; u>W = 1 and ESS= 1) are plotted with 
circles. The results displayed are averaged over P = 10 3 independent simulation runs. 

AT, increases, and considering a fixed number of samples M, the maximum 
importance weight moves apart from the optimal value 1/M and approaches 
degeneracy, where max^ = 1. This indicates that an increase in the num- 
ber of observations causes an increase in the fluctuations of the importance 
weights. 

On the other hand, in Figure Q] (right) the average ESS is also represented 
for several values of M and TV. The cases when the ESS is maximum, M e *' — 
M, and minimum, M e f* = 1, are plotted on the graph for reference. It can be 
observed that, as the number of observations N increases, the ESS is smaller 
for the same value of M, Actually, while the ESS always grows with M, its 
slope becomes very small for large values of N. For example, with TV = 1000 
observations and M = 1000 samples, we only obtain 1.5 effective samples on 
average. 

In Figure[2]an example of degeneracy in this simple setup is shown. A set of 
M = 200 samples has been drawn from the prior pdf and standard importance 
weights have been computed from the likelihood. A subset of these samples 
is depicted in figure [5] together with the associated weights. The likelihood 
function evaluated in the same region is depicted with contour lines and it 
presents high variations due to the high number of observations. It can be 
observed that only a small fraction of the sample set is close to the region 
where the likelihood is significant. As a result, one sample has a weight close 
to 1 and the rest of them have negligible weights. 



7.3 Comparison of algorithms 

In this section we compare, by way of computer simulations, the performance 
of the two proposed schemes (NPMC with tempering and with clipping, see 
Section [5]) and the standard PMC algorithm when applied to the Gaussian 
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Fig. 2 Subset of M = 200 samples flW drawn form the prior p(0) (blue circles) and the 
associated standard importance weights (red circles with size proportional to the weight 
uiW). The likelihood function p(y\0) is depicted with contour lines. 



Table 4 Standard PMC algorithm of [J]. 

Initialization (£ = 0): 

1. Consider a set of p scales (variances) Vj and an initial number rj = m of samples per 
scale, j = 1, . . . , p. 

2. For i = 1, . . . , M = pro, draw jfl^ } from q o (0) = p(0). 



Iteration (t = 1, ... ,L): 
1. For j = 1, . . . , p 

— generate a sample 1 0^ \ of size rj from 



[ (0)=M(ef ) ;0fl 1 ,v j I K ) , i = l,...,M, 



compute the normalized weights 



V 




pi 




1i 




) 



2. Resample with replacement the set < ^ > according to the weights w ( to obtain 

{»?>}-. 

3. For j = 1, . . . ,p update rj as the number of elements generated with variance Vj which 
have been resampled. 



mixture model. For convenience, we reproduce in Table 0] the original PMC 
scheme proposed in [7]. 

In order to asses the merit of an unweighted sample set }^i for the 
estimation of a scalar variable of interest 9k , we evaluate the mean square error 
(MSE) 

1 M 2 
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mean NESS 


std NESS 


mean MSEi 


mean MSE2 


std MSE l 


std MSE 2 


PMC 


0.131 


0.063 


0.037 


4.5 X 10" 3 


135.4 x 10" 3 


591.1 x 10" 6 


NPMC temp 


0.937 


0.059 


0.019 


3.3 x 10~ 3 


0.19 x 10~ 3 


5.71 x 10~ 6 


NPMC clip 


0.937 


0.060 


0.019 


3.3 x 10~ 3 


0.19 x 10" 3 


5.68 x 10" 6 


True posterior 






0.019 


3.3 x 10~ 3 


0.18 x 10~ 3 


5.53 x 10~ 6 



Table 5 Mean and standard deviation of the NESS, MSE of 61 and MSE of 8 2 for PMC, 
NPMC-temp and NPMC-clip. 



where M is the sample size and k £ {1, ...,K}. 

The parameters common to all the algorithms have been set to M = 200 
samples per iteration and L = 20 iterations. We have performed P = 10 3 inde- 
pendent simulation runs, with an independent random vector of observations 
y at each run. The standard PMC algorithm and the proposed NPMC tech- 
nique, both with tempering and clipping (labeled NPMC temp and NPMC 
clip, respectively, in the figures) have been run with the same observation 
vector y and the results have been compared in terms of the NESS and the 
MSE. 

The specific parameters of the plain PMC algorithm have been selected 
as suggested in [7] (p — 5 scales, v = [5, 2, 0.1, 0.05, 0.01] T , m = 40 samples 
per scale). A minimum of 1% of M of samples per scale has been kept as a 
baseline. 

In the NPMC algorithm with tempering, the sequence of parameters 7^, 
has been obtained from the sigmoid function of the iteration index, namely, 

7*= 1 + e - (< - B) , e = i,~.,L. 

With this choice of nonlinearity, the transformation of the weights is practically 
eliminated after 10 iterations and the algorithm performs standard IS. 

The NPMC algorithm with clipping has been simulated in its modified 
version, i.e., with the nonlinear transformation removed when the ESS reaches 
a value of = 100. In this problem, the simulations show that this occurs 

on average at the iteration I = 5. The parameter Mr has been set to Mt = 
M /4 = 50 samples. 

In Figures |3] and U] numerical results comparing the performance of the 
three algorithms (PMC, NPMC with tempering and NPMC with clipping) 
are displayed. In Figure |3] the evolution of the average NESS along the itera- 
tions is depicted. It can be observed that the original PMC scheme presents 
a low NESS, converging to a value o^| 0.131. As a consequence of the degen- 
eracy problem, the estimates of the variables of interest obtained from this 
reduced set of representative samples are often poor, presenting its variance 
large fluctuations along the iterations. Since the proposal density is updated 
based on the set of samples at the previous iteration, the update procedure 



2 Note that the NESS ranges between and 1. 
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PMC 

PMC temp 

PMC clip 



10 
Iteratipns 



Fig. 3 Evolution of the NESS for the standard PMC, NPMC with tempering and NPMC 
with clipping. 





Post 




PMC 




NPMC lemp 




NPMC clip 












Fig. 4 Evolution of the MSE of parameters 8\ (left) and 82 (right) for the different PMC 
schemes. 



also suffers from the lack of useful samples, leading to a very "noisy" conver- 
gence of the algorithm. On the contrary, the two NPMC schemes provide a 
smooth convergence of the NESS to a value of 0.937 in about 10 iterations. 

The degeneracy problem is most critical at the first iterations of the PMC, 
when the proposal pdf does not usually fit the target density. It is thus ex- 
tremely hard to obtain a set of representative samples (with non-negligible 
importance weights), which allow for a consistent and robust proposal update. 
In Figure [31 the starting value of the NESS corresponds to Mt/M = 0.25 for 
NPMC with clipping, 0.08 for NPMC with tempering and around 0.01 for the 
standard PMC algorithm. 

The mean and standard deviation of the NESS at the last iteration are rep- 
resented in Table[S]for all three algorithms. It can be checked that they present 
similar standard deviations, while the standard PMC presents a considerably 
lower mean NESS than the proposed methods. 

In Figure [4] the evolution along the iterations of the MSE for 9\ (left) 
and 62 (right) is represented for the three algorithms. The minimum MSE of 
each parameter, which has been numerically approximated from the posterior 
pdf p{Q\y), is also shown for reference. Again, it can be observed that the 
proposed NPMC schemes converge smoothly, reaching the minimum MSE in 
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a low number of iterations [I = 6). On the other side, the original PMC does 
not reach the minimum MSE with the given number of samples M = 200. 

However, the most outstanding difference in the performance of the three 
algorithms is the extremely high variance of the MSE of the original PMC 
scheme. Again due to degeneracy, this algorithm presents a highly varying 
performance along the iterations and the simulation runs. On the contrary, the 
proposed schemes reach the minimum MSE, both in average and in standard 
deviation. The final mean values, as well as the standard deviation of the MSE 
for 6*i and 6*2 are also shown in Table [SJ 

Moreover, the original PMC scheme hardly presents any adaptation along 
most of the iterations and it allows for no parameterization of the convergence 
speed. It reaches the final solution at the second iteration, which is hardly 
improved for the rest of the simulation. This behavior is not desirable in an 
iterative method, and it suggests that the adaptation is not performed in a 
appropriate way. On the contrary, the convergence speed of the NPMC method 
with clipping may be adjusted by jointly selecting the parameter^ M and Mt, 
and that of NPMC with tempering, by modifying the sequence 7^. 



8 Example 2: A stochastic kinetic model 

In this section, the proposed nonlinear PMC method is applied to the prob- 
lem of estimating the hidden parameters of a simple stochastic kinetic model 
(SKM), known as the predator-prey model. A SKM is a multivariate continuous- 
time jump process modeling the interactions among molecules, or species, 
that take place in chemical reaction networks of biochemical and cellular sys- 
tems [31121]. 

Several MCMC schemes have been recently proposed to address the prob- 
lem of estimating the rate parameters in SKMs, [3], [22], [16], [12]. In [16] 
the authors propose an approximation of the likelihood based on the moment 
closure approximation of the underlying stochastic process. In [12j a likelihood- 
free particle MCMC scheme is applied to this problem. A recently proposed 
work addressing this kind of problems is [T], where several MCMC schemes 
are proposed to address likelihood-free scenarios. 

We propose an alternative approach based on the NPMC method. In par- 
ticular, we introduce a modification of the generic NPMC algorithm which 
computes an approximation of the likelihood via a particle filter. The proposed 
method is exact in the sense that it simulates from the stochastic process in 
the model without further approximations. 



3 Selecting a low value of My leads to a faster convergence to a worse estimate. On the 
contrary, an increase in the value of Mt slows down the convergence but the final estimate 
has a smaller error. We have numerically observed that a value for the ratio Mf/M between 
0.2 and 0.7 leads to good results. 



Population Monte Carlo with transformed weights 



23 



8.1 Predator- prey model 

The Lotka-Volterra, or predator-prey, model is a simple SKM that describes 
the time evolution of two species x\ (prey) and xi (predator), by means of 
K = 3 reaction equations: [3"ll2"U] 

x\ — ^ 2xi prey reproduction 

x\ + X2 2x2 predator reproduction 

X2 — ^> predator death 

The fe-th reaction takes place stochastically according to its instantaneous 
rate a k (t) = 9 k g k ( x (£))i where 9 k > is the constant (yet random) rate 
parameter and g k {-) is a continuous function of the current state of the sys- 
tem x(t) = [xi(t),X2(t)] T . We denote by xi(t), X2(t) the nonnegative, integer 
population of each species at continuous time t. In this simple example, the 
instantaneous rates are of the form 

a x (t) = B lXl {t), o a (t) = 9 2 x 1 {t)x 2 {t) ) a 3 {t) = 9 3 x 2 (t). 

The waiting time to the next reaction is exponentially distributed with 
parameter ao(t) — ^2 k= iQ>k(t), an d the probability of each reaction type is 
given by a k (t)/ao(t). We denote by x the vector containing the population of 
each species at the occurrence time of each reaction in a time interval t £ [0, T] , 

i.e., 

x= [x T (< 1 ),x T (i 2 ),...,x T (^)] T , 

where R is the total number of reactions and U G [0,T], i = 1,...,_R, are 
the time instants at which the reactions occur. Assuming that the entire 
vector x is observed, the likelihood function for the rate parameter vector 
9 = [6x, • ■ ■ , 9k] T can be computed analytically [21], 

p(x|0) = f[p(x\9 k ) = T[9 r k «ex P \-9 k F g k (x(t)) dt\ , (12) 
k=i k=i I Jo J 

where r k is the total number of reactions of type k occurred in the time interval 
[0,T]. 

The structure of the likelihood function in (fT2|) allows the selection of a 
conjugate prior distribution for the rate parameters in the form of independent 
Gamma components, i.e., 

K K 

p(p) = n P (9 k ) = n G(e k ; o fc , b k ), (13) 

fe=l k=l 

where a k ,b k > are the scale and shape parameters of each component, 
respectively. Thus, the posterior distribution p(6\x) = Jlfc=i P(^fel x ) may be 
also factorized into a set of independent Gamma components 

p(6 k \x) = Q (e k ; a k +r k ,b k + J g k (x(t))dtj, 
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Therefore, in the complete-data scenario (in which x is observed), exact in- 
ference may be done for each rate constant 9k separately. However, making 
inference for complex, high-dimensional and discretely observed SKMs (where 
x is not fully available) is a challenging problem [3J. 

Exact stochastic simulation of generic SKMs, and predator-prey models in 
particular, can be carried out by the Gillespie algorithm This procedure 
allows to draw samples from the prior pdf of the populations, p(x.\8), for 
arbitrary SKMs. 

Although in this work we restrict ourselves to the predator-prey model, 
which is a simple but representative example of SKM, the proposed algorithm 
may be equally applied to the estimation of parameters of SKMs of higher 
complexity. 



8.2 NPMC algorithm for SKMs 

We assume that a set of N noisy observations of the populations of both 
species are collected at regular time intervals of length A, that is, y„ = x„ + 
u„, n = 1, . . . ,N, where x„ = [xi(nA), £ 2 (nZ\)] T and u n is a Gaussian noise 
component with zero mean vector and covariance matrix a 2 l. We denote the 
complete vector of observations as y = [yj , • • • , yJj] T with dimension 2N x 1. 
Thus, the likelihood of the populations x is given by 



The goal is to approximate the posterior distribution, with density p(9\y) oc 
p(y\9)p(9) , given the prior pdf p(9) and the likelihood p(y\9), using the NPMC 
method. 

In this particular problem, the observations y are related to the variables 
9 through the random vector x. Indeed, the likelihood of 9 has the form 



where £w x |0)[ - ] denotes expectation with respect to the pdf in the subscript, 
and p(y|x, 9) — p(y|x), since the observations are independent of the param- 
eters 9 given the population vector x. As a consequence, the likelihood of the 
parameters cannot be evaluated exactly. A set of likelihood-free techniques 
have been recently proposed to tackle this kind of problems, which avoid the 
need to evaluate the likelihood function pQ. However, in this work we follow 
a different approach which consists in computing an approximation of the 
likelihood of p{y\9). 

In principle, it is possible to approximate the integral in (|14p as an average 
of the likelihoods p(y|x^) of a set {xW}|_ x of exact Monte Carlo samples 



N 



p(y|x) = JJ^(yni 





(14) 



Population Monte Carlo with transformed weights 



25 



from the density p(x\8), drawn using the Gillespie algorithm, that is, 



This approach, however, is computationally intractable, because it demands 
drawing a huge number of samples / to obtain a useful approximation of the 
likelihood p(y\8), since the probability of generating a trajectory of popula- 
tions similar to the observations is extremely low. 

To overcome this difficulty, we propose a simple approach based on using a 
standard particle filter. An approximation for the likelihood of the parameters 
p(y | 8^') may be recursively obtained based on the particle filter approxima- 
tion of the posterior of the populations p(x | y, 8^). Indeed, the likelihood of 
a sample 8^' can be recursively factorized as 

N 

p(y|0 (i) ) = lb (ynlyi : »-i,0 (i) 

n=l 



where each of the terms in the product (with fixed 8 = 8^) can be approxi- 
mated via particle filtering (see, e.g., [8] or [5])- 

The NPMC method for this application is exact in the sense that it uses 
exact samples x'*' from the stochastic model generated via the Gillespie al- 
gorithm, and does not perform any approximation of the stochastic process. 
Alternative approaches have been proposed which employ a diffusion approxi- 
mation. However, this approximation has not shown to introduce any improve- 
ment to our algorithm. 

To summarize, the standard importance weights for this application have 
the form 

p(y\df)p(8f 



(*)* 



(i) 



where the likelihood p(y\8g) is approximated using a particle filter. The prior 
is a product of K independent Gamma components as in Eq. (|13[) and the 
proposal is a multivariate normal pdf, as shown in Tabled] 

Let us finally remark that, while in this work we have focused on the 
complete-observation scenario, the proposed method is also valid in the par- 
tially observed case, when only a subset of the species can be observed. 



8.3 Computer simulation results 

We have applied the NPMC algorithm to the problem of estimating the pos- 
terior pdf of the constant rate parameters vector 8 in a simple predator-prey 
model. The true vector of parameter rates which we aim to estimate has been 
set to 

8 = [0.5, 0.0025, 0.3] T . 
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A realization of the populations x has been generated from the prior dis- 
tribution p(x|0) with initial populations x(0) = [7f,79] T , and a total length 
of T = 40. The observation vector y has been generated with an observation 
period A = 1 and a Gaussian noise variance a 2 = 100. 

Figure [5] depicts the time evolution of the true populations of both species 
x, and the corresponding discrete-time noisy observations y. The autoregula- 
tory behavior of the model can be clearly observed on the graph. 



600 




10 20 30 40 

Time 

Fig. 5 Real populations (x) and discrete-time noisy observations (y) in a predator-prey 
model. 

The parameters of the prior Gamma distribution p(9) have been chosen 
such that the corresponding mean matches the real value of 6 and the marginal 
standard deviations are [1.25, 0.0065, 0.77] T . These priors cover an order of 
magnitude either side of the true value, which represents a vague prior knowl- 
edge. 

The target density for the NPMC algorithm is the posterior pdf, tt(6) = 
p(9\y). Despite the low dimension of this problem (K = 3), the importance 
weights of the PMC scheme present severe degeneracy, partially due to the 
likelihood approximation. In the simulations, we have observed that the im- 
portance weights present such degeneracy, that, for the NPMC scheme with 
tempering, the coefficient j£ must take extremely low values (about 10 -4 ) at 
the first iterations in order to balance the weights and obtain a reasonable 
number of effective samples. Besides, the sequence is hard to design a pri- 
ori, and it does not guarantee that the resulting sample set leads to accurate 
estimates. For that reasons, in this example we have restricted our attention 
to the hard clipping technique, which does not require the fitting of any pa- 
rameters and guarantees a baseline ESS. Recall that the NPMC scheme with 
hard clipping only requires the selection of a unique parameter Mr, which 
allows to adjust the convergence performance of the algorithm, and does not 
require a precise fitting. In this example, we have chosen My = M/5, with a 
total number of samples M — 500 and L = 10 iterations. 

In order to be able to compare the MSE of parameters 9k, k — 1,2,3, that 
take very different values, we additionally define normalized MSEs (NMSE) 
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Fig. 6 Left: Average value of log 10 (NMSE) versus the NESS after the last iteration (L = 10) 
for each simulation run, with M = 500 samples. The histogram of each variable is also 
represented. Right: Average value of log 10 (NMSE) versus the NESS after the last iteration 
(L = 10) for each simulation run, with M = 500 for 70% of the simulation runs, M = 1000 
for the 28% of the simulation runs and M = 2000 for 8% of the simulation runs. 



and a global error figure as the average NMSE, namely 

1 K 

NMSE = — V NMSE k . 
K — ' 

fe=i 

We have carried out two different experiments. In both experiments we 
have performed P — 100 independent simulation runs, with the same true 
parameters and initial populations x(0), but different (independent) popu- 
lation and observations vectors x and y in each run. 

In the first experiment we considered the same number of samples M = 500 
in all the simulation runs of the NPMC algorithm and computed the NESS 
and the NMSE along the iterations. We observed that a 70% of the simulation 
runs converged to yield accurate estimates (with low NMSE), while in a 30% of 
the cases, the algorithm produced biased estimates, with considerably higher 
NMSE and low final NESS (below a 0.3). 

In Figure [5] (left) the final values of the logarithm of the NMSE versus 
the NESS obtained for each simulation run are depicted, together with the 
histograms of both the log 10 (NMSE) and the NESS. This plot reveals how low 
NESS values correspond to high NMSE, and viceversa. This empirical relation- 
ship is relevant because allows to detect, in practice, whether the algorithm 
has converged properly or not. 

The second experiment consisted in repeating the simulation of the algo- 
rithm in those cases which did not yield accurate estimates with M = 500 
samples, but increasing the number of samples to M = 1000. For this new 
experiment, we considered that the algorithm had converged when the final 
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Fig. 7 Left: Evolution of the mean NESS, before and after clipping. The NESS before 
clipping (computed for the standard weights) is labeled NESS orig. Right: Evolution of the 
mean MSE. The lower bound is obtained from the optimal posterior p(0|x) (labeled Opt 
post in the figure). 
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MSE 2 


MSE 3 
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NMSE 2 


NMSE 3 


mean NMSE 


Prior 


1.56 


4.42 x 10~ 5 


0.5929 


6.25 


6.76 


6.588 


6.533 


NPMC clip (1) 


9.7 x 1CT 3 


53.0 x 10" 9 


36.8 x 10" 4 


0.03878 


0.008484 


0.04089 


0.02938 


NPMC clip (2) 


1.4 x icr 3 


23.46 x 10 -9 


6.38 x 10~ 4 


0.005618 


0.003754 


0.007092 


0.005488 


Optimal posterior 


0.26 x icr 3 


5.53 x 10" 9 


0.881 x 10" 4 


0.00103 


0.0008862 


0.0009796 


0.0009654 



Table 6 Final mean MSE and NMSE for the parameters 0\, 02 and 03 in experiments 1 
and 2. The prior and optimal posterior values are included for comparison. 



NESS was greater than 0.3. Out of the repeated simulations, 8 cases still did 
not converge and the simulations were repeated with M — 2000. Finally, all 
the simulation runs obtained a final NESS> 0.3 with a number of particles 
M = 500 for the 70% of the cases, M = 1000 for the 22% of the cases and 
M = 2000 for the 8% of the cases. 

In Figure H] (right) the final values of log 10 (NMSE) versus the NESS are 
plotted after repeating the simulations with M = 1000 (22% of the runs) and 
M = 2000 (8% of the runs). It can be observed that the points with high 
NMSE are removed and that most points concentrate in the region of high 
NESS and low NMSE. Also the histograms of NMSE and NESS show how 
their mean values significantly improve. 

Figure shows the evolution of the mean NESS and NMSE along the 
iterations in both experiments. In solid blue lines, the results regarding the first 
experiment are plotted (M = 500 for all simulation runs). In solid red lines, 
the results of the second experiment are represented (M = 500, M — 1000 
and M = 2000). 

In Figure [7] (left) the evolution of the NESS, M" e// , after the clipping 
procedure in both experiments is represented. In both cases it starts from 
an initial value of Mt/M and reaches a final value of about 0.43 in the first 
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Fig. 8 Example of pdfs of interest of a particular simulation run with final NMSE of 0.0053 
(close to the average NMSE): marginal priors p(0fc), optimal posteriors p(0fc|x), estimated 
posteriors p(0fc|y) and true values 8^ for k = 1 (left), k = 2 (central) and k = 3 (right). 



case and 0.5 in the second cas^j|. Also the NESS before clipping (i.e., for the 
standard weights) is represented in this graph, which takes significantly lower 
values, but presents a significant increase after the first iterations. In this 
example, the clipping procedure has been performed along all the iterations 
because of the severe degeneracy of the weights. This is probably due to the 
approximation in the computation of the likelihood p{y\0), which leads to 
additional fluctuations in the values of the importance weights, increasing the 
effects of degeneracy. 

Figure [7] (right) shows the evolution of the NMSE along the iterations, as 
well as the corresponding lower bound given by the optimal marginal posterior 
with complete data, p(6k\x), in both experiments. The starting value at I = 
corresponds to the vague knowledge described by the prior distribution. It 
can be seen that the NMSE of the NPMC algorithms with clipping smoothly 
decreases until a final value close to the lower bound, in a low number of 
iterations. 

Table [5] summarizes the obtained results in terms of MSE and NMSE for 
each parameter 6^ for both experiments. It can be observed that the prior 
distribution provides vague knowledge of the variables of interest, with a high 
MSE and NMSE. The NMSE allows to compare the error values of the three 
parameters, even when 02 has a very low value with respect to 9\ and #3. 

In the first experiment we obtain a remarkable reduction in MSE and 
NMSE with respect to the prior. However, increasing the number of samples 
in a fraction of cases allows to further reduce the estimation error. The MSE 
and NMSE of the optimal marginal posteriors are also shown for comparison. 
Note, however, that this is an unreachable performance bound, since it is 
obtained when the whole vector x is observed. 

In order to illustrate the quality of the obtained approximations, Figure 
[5] depicts the final estimate of the marginal posteriors p(0k\y), together with 
the corresponding prior p(9k) and the optimal posterior in the complete-data 
scenario p(6>fc|x). It can be observed that the algorithm provides a good ap- 
proximation of the 

4 Note that the NESS increases beyond the effect of the clipping procedure, which in- 
dicates that the iterative scheme is able to generate more representative samples as the 
algorithm converges. 
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9 Conclusion 

Wc have addressed the problem of approximating posterior probability distri- 
butions by means of random samples. A recently proposed approach to tackle 
this problem is the population Monte Carlo method, that consists in iteratively 
approximating a target distribution via an importance sampling scheme. The 
main limitation of this algorithm is that is presents severe degeneracy of the 
importance weights as the dimension of the model, K, and/or the number of 
observations, N, increase. This leads to a highly varying number of effective 
samples and inaccurate estimates, unless the number of samples is extremely 
high (which makes the method computationally prohibitive). 

We propose to apply a simple procedure in order to guarantee a prescribed 
ESS and a smooth and robust convergence. It consists in applying nonlinear 
transformations to the standard importance weights in order to reduce their 
fluctuations and thus avoid degeneracy. In order to illustrate the application 
of the proposed technique, we have applied it to two examples of different 
complexity. The first example is a simple GMM, which allows to get insight of 
the performance of the standard PMC scheme, the degeneracy problem and 
the behavior of the proposed algorithm. 

We have tackled the problem of estimating the set of constant (and ran- 
dom) rate parameters of a stochastic kinetic model. Even for the relatively 
simple predator-prey model that we have studied, this is significantly more 
complex than the GMM example. We have shown extensive simulation results 
that show how the proposed NPMC scheme can greatly improve the perfor- 
mance of the standard method. 

The convergence of standard PMC algorithms is often justified by the 
asymptotic convergence of IS (with respect to the number of samples). The 
NPMC scheme modifies the importance weights and, hence, the standard the- 
ory of IS cannot be applied directly. To address this difficulty, we have an- 
alyzed the convergence of the approximations of integrals computed using 
transformed importance weights and proved that they converge a.s., similar 
to the results available for standard IS. 
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been partially supported by Ministerio de Economia y Competitividad of Spain (program 
Consolider-Ingenio 2010 CSD2008-00010 COMONSENS and project DEIPRO TEC2009- 
14504-C02-01). 



A Proof of Lemma [T] 

As a first step, we seek a tractable upper bound for the difference 
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(15) 
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where 



s (i) = (y M °g)(9 (0 ) 
Etii(^ M °9)(e (j) ) 

" E"i9(0 (3) ) ' 



and 



(16) 
(17) 



and (tfi M og)(0) = ip M (g(0)) denotes the composition of the functions ip M and g. Moreover, 
the constants in the denominators of the weights can be written as integrals with respect to 
the random measure 

1 M 

q M (M) = -J2SgU)(de), (18) 



3=1 



namely, 



and 



'£( l p M og)(eW)=M( V M o g , q M ) 

j = l 



(19) 



J2a(0 U) ) = M(g,q M ). (20) 

3=1 

Substituting H16> . ]17> . H19I I and 12U> . into I I15H yields, after straightforward manipulations, 



(/,7f M )-(/,7f M ) 



J- y ^b.^)-^'^) 

M 2_^ > (v m om m )(m m) 



(21) 



A useful upper bound for the difference of integrals follows quite easily from i'2\li . In partic- 
ular, note that |/(0 W )(<£ M o g)\ < ||/j| oo ||s||oo, since /, g e B(R K ) and v? M o g < g, while 
the latter inequality also implies that (ip M o g, q M )(g, q M ) > (f M ° <7, q M ) 2 - Also note that, 
from the definition of ip M , 



\(g,q M ) - (v> M og,q M )\ < JL g _ o S )(0C**))| 

fc — 1 

2M T ||g|| 00 



< 



M 



As a result, we obtain 



\ lf -Ms u ~Mn| ^ 211/HooHgHgoMT 
Let ci > be some arbitrary real constant. From (1 23 It . 



(/,* M )-(/,* M ) < 



Mt 
~M 



C1 } > 



43 _ 

{M(ip M og,q M ) 2 ~ M 



lloollffllloMT ^ M T 



If we choose 



1 l_ 1 



(9,q) 2 



where 1 < a < v2 : then substituting d 2 5 1) into the right hand side of i !24[ i yields 



\ A% Af o g,q M ) 2 ~ M 01 



& M og,q M f> 



1 1 

a V2 



(<p M °g,q M )- -(g,q) >*(g,q) 
a V2 



M( V M o g, q™ )---r-(g,q)>-^( g , q )j, (26) 
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where the second equality holds because i — > 0. 

Next, consider the expectations E q [((p M og,q M )] and q M )] = (g, q). Since, |(ip M o 

9:9 M ) - (a,1 M )\ < 2MT||ff|[oo/M (see Eq. J22J0, it follows that 

^[(^Qff,,")] -£ ? [(g,g M )]| < £ q [|( V M o fl ,^)-( s ,g M )|] 

2Af T ||g||oc 

- M ' 

Therefore, since we have assumed that limM->oo = and g > 0, there exists M a such 
that, for all M > M a , 

E q [( V M og,q M )] > X -E q [(g,q M )] = ±(g,q), (27) 
and combining Eq. (I2(it with the inequality II27I I we obtain that 

m f all/ll^llgllgc^T < Mr 



\ M(<p M o g, q M ) 2 ~ M Ci J ~ 
P{M(^ M o g, q M ) - ME q [( 9 M o g, q M )] > -lL(g, g)} . (28) 

Since M{<p M o g, g M ) = ip M (g{0^)) is the sum of M independent and bounded 

random variables, each of them taking values within the interval (0, ||g||oo), it is straightfor- 
ward to apply Hoeffding's tail inequality 1 131 to obtain a lower bound on II28II . namely 

W^M{ V M og, q M )-ME q [(<p M og, q M )] > -^(g, g)} > 1 - exp {-^--M.} . (29) 

Substituting H29l back into 128 1 . J26l) and H24\i yields the desired result, 

p||(/,7f M )-(/^ M )| < > 1-expj-ciM}, (30) 

with 4 = fefil. 



B Proof of Lemma H 

The argument is similar to that of the proof of Lemmaf^ Recalling Eqs. 1171 . 1181 and (1206 
in Appendix fS] as well as the form of the standard normalized weights, 



M(g,q M )' 



it is straightforward to show that 



k/,* M )-c/> M ) 



i 



M(g,q M ) 
which readily yields the upper bound 



(/,* M )- (/,^ M ) < 



M( S ,gM) 



(31) 
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Let C2 > be some arbitrary real constant. From (1316 . 

W ' ; w ' ; l - Af J" \ M{g,q M ) ~ M 
and if we choose 

211/llooNloo 

C2 



then 

f2 H%i%^ < ^ C2 \ = p/^.M) > A _ 1 ) , 

M( ff ,g*r) " Af 2 J \ vy,y 7 "V v^/ 

Af( g) g M )-M(s ig )>-^(g )g )}. (33) 

Since (g,q) = i?q [(g, q M )\ and (g,q M ) is the sum of M independent, and bounded, 
random variables taking values within the interval (0, ||g||oo] (recall that g > 0), we can 
readily apply Hoeffding's tail inequality 1131 on Eq. 1331 1 to obtain 

M(g,q M )-M(g,q) > ~(g,q)\ > 1 - exp I-^^-mX . (34) 



%/2 J ~ I W'jWL 

Substituting (1341 ) back into II33H and ]32l l yields the desired result, 

p{|(/,* M )-(/ )7 r M )| < ffc 2 } >l-exp{-4M}, 



where c 2 = fejlj- > 0. 



C Proof of Lemma [3] 



The first part of Lemma [3] follows from the combination of Lemmas ^ and [2] We first note 
that, from Lemma [T] 

P<[|(/,* M ) " Cf,* M )| > a^} < exp{-4M} (35) 
for sufficiently large M, while Lemma [2] implies 

P{|(/,* M )-C/> M )| >c 2 ^} <cxp{-c 2 M} 
where C2 = c 2 = (g, 9) 2 /IMIoo- Let c = C\ + C2. Then, since 

l(/,* M ) - C/> M )| < l(/,* M ) - U,* M )\ + l(/,* M ) - 0> M )I, 

we trivially obtain that 



(36) 



| (/, ) - (/, ) | > C^- | < P | | (/, ff" ) - (/, ) | + | (/, ) - (/, 7T- ) | > C ^ 

(37) 

However, if 

| \J ' )-(/. 7r )| + |(/- 7r )-(/- 7r • ) | >C ~^7" 
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is true, then 



|(/,*^- ( /,^)|>d^ or | (/ ,*M ) _ (/)7r M ) | >C2 ^: i 



or both jointly, are true. Therefore, 



k/,ff M )-(/.* M ) + (/,*") -(/,t m )>c 



Mr ] 
M J 



(/,*") | > C1 ^ 
- {/ ,^)|> C2 ^} 

< exp {— c^M } + cxp {-c' 2 M } 
I llsllio J 

for sufficiently large M, where the second inequality follows from d 35 II and II36II . and the 

equality is due to the fact that c' ± = c' 2 - 

Combining (I3T6 and (138 It yields the first part of Lemma[3] with c' = (<7, f?) 2 /!!;?!! 2 *) ■ 
The second part of Lemma [3] follows from a standard Borel-Cantclli argument. Indeed, 

let Em be the event in which |(/, 7r M ) — (/, vr M )| > c^p- holds true. From the first part of 

the Lemma, 

P{£ M } < 2exp{-c'A/} , 

with c' > 0, for sufficiently large M (specifically, for all M > M a , with M a as in the proof 
of Lemma^J. Therefore, 

00 00 

J2 F i £ M} < M a + cxp {-c'M} < OO, 

M=l M=M a +l 

because M a < 00 and X)m=M +i ex p{— c'M} < 00. As a consequence (see, e.g., |23l 
Theorem 2.7]), P {limsup Sm } = 0, which implies that 

Jim |(/,7f M )-(/,7r M )[ = a.s. 
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